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We present an algorithm to store binary memories in a Hopfield neural network using minimum 
probability flow, a recent technique to fit parameters in energy-based probabilistic models. In the 
case of memories without noise, our algorithm provably achieves optimal pattern storage (which 
we show is at least one pattern per neuron) and outperforms classical methods both in speed and 
memory recovery. Moreover, when trained on noisy or corrupted versions of a fixed set of binary 
patterns, our algorithm finds networks which correctly store the originals. We also demonstrate this 
finding visually with the unsupervised storage and clean-up of large binary fingerprint images from 
significantly corrupted samples. 



Introduction. In 1982, motivated by the Ising spin 
glass model from statistical physics [T], Hopfield intro- 
duced an auto-associative neural-network for the storage 
and retrieval of binary patterns [2]. Even today, this 
model and its various extensions [3 , 4 provide a plausible 
mechanism for memory formation in the brain. However, 
existing techniques for training Hopfield networks suffer 
either from limited pattern capacity or excessive training 
time, and they exhibit poor performance when trained 
on unlabeled, corrupted memories. 

Our main theoretical contributions here are the intro- 
duction of a tractable and neurally-plausible algorithm 
for the optimal storage of patterns in a Hopfield network, 
a proof that the capacity of such a network is at least 
one pattern per neuron, and a novel local learning rule 
for training neural networks. Our approach is inspired by 
minimum probability flow [5], a recent technique for fit- 
ting probabilistic models that avoids computations with a 
partition function, the usually intractable normalization 
constant of a parameterized probability distribution. 

We also present several experimental results. When 
compared with standard techniques for Hopfield pattern 
storage, our method is shown to be superior in efficiency 
and generalization. Another finding is that our algo- 
rithm can store many patterns in a Hopfield network 
from highly corrupted (unlabeled) samples of them. This 
discovery is also corroborated visually by the storage of 
64 x 64 binary images of human fingerprints from highly 
corrupted versions, as explained in Fig. [2j 

Background. A Hopfield network T~L = (J, 0) on n 
nodes {1, . . . , n} consists of a symmetric weight matrix 
J = J T e R nxn with zero diagonal and a threshold vec- 
tor 9 = (6 X , . . . ,6>„) T e R n . The possible states of the 
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FIG. 1. Example Hopfield Network. The figure above 
displays a 3-node Hopfield network with weight matrix J 
and zero threshold vector. Each binary state vector x = 
(xi, X2, X3) T has energy E x as labeled on the y-axis of the 
diagram on the right. Arrows between states represent one 
iteration of the network dynamics; i.e., xi, X2, and x% are 
updated by Q in the order indicated by the clockwise arrow 
in the graph on the left. The resulting fixed states of the 
network are indicated by filled circles. 



network are all length n binary strings {0, 1}™, which we 
represent as binary column vectors x = {x\, . . . , i„) T , 
each Xi € {0, 1} indicating the state Xi of node i. Given 
any state x = (x\, ■ ■ ■ , x n ) T , an (asynchronous) dynami- 
cal update of x consists of replacing Xi in x (in consecutive 
order starting with i = 1; see Fig [I]) with the value 



ff(JiX-0i). 



(1) 



Here, Jj is the ith row of J and H is the Heaviside func- 
tion given by H(r) = 1 if r > and H (r) = if r < 0. 
The energy E x of a binary pattern x in a Hopfield 



2 



abed 




f g h 



FIG. 2. Learning memories from corrupted samples. 

We stored 80 fingerprints (64 x 64 binary images) in a Hopfield 
network with n = 64 2 = 4096 nodes by minimizing the MPF 
objective |4| over a large set of randomly generated (and unla- 
beled) "noisy" versions (each training pattern had a random 
subset of 1228 of its bits flipped; e.g., a,e). After training, 
all 80 original fingerprints were stored as fixed-points of the 
network, a. Sample fingerprint with 30% corruption used 
for training, b. Sample fingerprint with 40% corruption, c. 
State of the network after one update of the dynamics initial- 
ized at b. d. Converged network dynamics equal to original 
fingerprint, e-h. As in a-d, but for a different fingerprint. 

network is defined to be 

1 " 
E X (J, 9) := -x T Jx + 9 T x = - ^ XiXjJij + 6^, 

i<j i=l 

(2) 

identical to the energy function for an Ising spin glass. In 
fact, the dynamics of a Hopfield network can be seen as 
0-temperature Gibbs sampling of this energy function. A 
fundamental property of Hopfield networks is that asyn- 
chronous dynamical updates do not increase the energy 
([2]). Thus, after a finite number of updates, each initial 
state x converges to a fixed-point x* = (#*,... ,x^) T of 
the dynamics; that is, x* = H (J.;x* — 6^) for each i. See 
Fig. [2] for a sample Hopfield network on n = 3 nodes. 

Given a binary pattern x, the neighborhood A/"(x) of 
x consists of those binary vectors which are Hamming 
distance 1 away from x (i.e., those with exactly one bit 
different from x) . We say that x is a strict local minimum 
if every x' € A/"(x) has a strictly larger energy: 

0>E x -E x ,=(J i x-6 i )8 i , (3) 

where 8i = 1 — 2xi and xt is the bit that differs between x 
and x'. It is straightforward to verify that if x is a strict 
local minimum, then it is a fixed-point of the dynamics. 

A basic problem is to construct Hopfield networks with 
a given set T> of binary patterns as fixed-points or strict 
local minima of the energy function ([2]). Such networks 
are useful for memory denoising and retrieval since cor- 
rupted versions of patterns in T> will converge through 
the dynamics to the originals. Traditional approaches 



to this problem consist of iterating over T> a learning 
rule [6] that updates a network's weights and thresholds 
given a training pattern x e V. We call a rule local 
when the learning updates to the three parameters JV,-, 
9i, and 9j can be computed with access solely to Xi,xj, 
the feedforward inputs J;X, JjX, and the thresholds 6i, 
9j] otherwise, we call the rule nonlocal. The locality of 
a rule is an important feature in a network training al- 
gorithm because of its necessity in theoretical models of 
computation in neuroscience. 

In [2], Hopfield defined an outer-product learning rule 
(OPR) for finding such networks. OPR is a local rule 
since only the binary states of nodes Xi and Xj are re- 
quired to update a coupling term during training (and 
only the state of Xi is required to update 9i). Using OPR, 
at most n/(41ogn) patterns can be stored without er- 
rors in an n-node Hopfield network |3[8j. In particular, 
the ratio of patterns storable to the number of nodes us- 
ing this rule is at most 1/(4 log n) memories per neuron, 
which approaches zero as n increases. If a small per- 
centage of incorrect bits is tolerated, then approximately 
0.15n patterns can be stored [2|9]. 

The perceptron learning rule (PER) [101 111] provides 
an alternative method to store patterns in a Hopfield 
network [TJ] . PER is also a local rule since updating Jy 
requires only J,x and JjX (and updating 9i requires J,x). 
Unlike OPR, it achieves optimal storage capacity, in that 
if it is possible for a collection of patterns T> to be fixed- 
points of a Hopfield network, then PER will converge to 
parameters 3,9 for which all of V are fixed-points. How- 
ever, training frequently takes many parameter update 
steps (see Fig. [4]) , and the resulting Hopfield networks do 
not generalize well (see Fig. [5| nor store patterns from 
corrupted samples (see Fig. [6JT 

Despite the connection to the Ising model energy func- 
tion, and the common usage of Ising spin glasses (oth- 
erwise referred to as Boltzmann machines [1]) to build 
probabilistic models of binary data, we are aware of no 
existing work on associative memories that takes ad- 
vantage of a probabilistic interpretation during training. 
(Although probabilistic interpretations have been used 
for pattern recovery [15].) 

Theoretical Results. We give an efficient algorithm 
for storing at least n binary patterns as strict local min- 
ima (and thus fixed-points) in an n-node Hopfield net- 
work, and we prove that this algorithm achieves the op- 
timal storage capacity achievable in such a network. We 
also present a novel local learning rule for the training of 
neural networks. 

Consider a collection of m binary n-bit patterns T> to 
be stored as strict local minima in a Hopfield network. 
Not all collections of m such patterns T> can so be stored; 
for instance, from ([3]) we see that no two binary patterns 
one bit apart can be stored simultaneously. Nevertheless, 
we say that the collection T> can be stored as local minima 
of a Hopfield network if there is some H = (J, 9) such 
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that each x £ T> is a strict local minimum of the energy 
function E X (J,6) in 

The minimum probability flow (MPF) objective func- 
tion [5] given the collection V is 




The function in Q is infinitely differentiable and strictly 
convex in the parameters. Notice that when Kx>(3,9) is 
small, the energy differences i? x — E x > between x £ V 
and patterns x' in neighborhoods Af(x) will satisfy 
making x a fixed-point of the dynamics. 

As the following result explains, minimizing Q given a 
storable set of patterns will determine a Hopfield network 
storing those patterns. 

Theorem 1. If a set of binary vectors T> can be stored 
as local minima of a Hopfield network, then minimizing 
the convex MPF objective will find such a network. 

Proof: We first claim that T> can be stored as local 
minima of a Hopfield network H if and only if the MPF 
objective Q satisfies Kx>(3,9) < 1 for some J and 9. 
Suppose first that T> can be made strict local minima 
with parameters J and 9. Then for each x £ T> and 
x' G A/"(x), inequality ^ holds. In particular, a uniform 
scaling in the parameters will make the energy differences 
in Q arbitrarily large and negative, and thus K can be 
made less than 1. Conversely, suppose that Kx>{3, 9) < 1 
for some choice of J and 9. Then each term in the sum 
of positive numbers Q is less than 1. This implies that 
the energy difference between each x€P and x' € A/"(x) 
satisfies Thus, V are all strict local minima. 

We now explain how the claim proves the theorem. 
Suppose that V can be stored as local minima of a Hop- 
field network; then, Kj)(3,9) < 1 for some J, 9. Any 
method producing parameter values J and 9 having ob- 
jective Q arbitrarily close to the infimum of Kx>(3,9) 
will produce a network with MPF objective strictly less 
than 1, and therefore store V by above. □ 

Our next main result is that at least n patterns in an n- 
node Hopfield network can be stored by minimizing Q. 
To make this statement mathematically precise, we intro- 
duce some notation. Let r(m,n) < 1 be the probability 
that a collection of m binary patterns chosen uniformly 
at random from all (~ ) m-element subsets of {0, 1}™ can 
be made local minima of a Hopfield network. The pattern 
capacity (per neuron) of the Hopfield network is defined 
to be the supremum of all real numbers a > such that 

lim r(an,n) = 1. (5) 

n— >oo 

Theorem 2. The pattern capacity of an n-node Hop- 
field network is at least 1 pattern per neuron. 

In other words, for any fixed a < 1, the fraction of 
all subsets of m — an patterns that can be made strict 
local minima (and thus fixed-points) of a Hopfield net- 
work with n nodes converges to 1 as n tends to infinity. 
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FIG. 3. Shows fraction of patterns made fixed-points of a 
Hopfield network using OPR (outer-product rule), MPF (min- 
imum probability flow), and PER (perceptron) as a function 
of the number of randomly generated training patterns m. 
Here, n = 64 binary nodes and we have averaged over t = 20 
trials. The slight difference in performance between MPF and 
PER is due to the extraordinary number of iterations required 
for PER to achieve perfect storage of patterns near the critical 
pattern capacity of the Hopfield network. See also Fig. [4] 

Moreover, by Theorem 1, such networks can be found 
by minimizing Q. Experimental evidence suggests that 
the limit in d5h is 1 for all a < 1.5, but converges to 
for a > 1.7 (see Fig. [3]). Although the Cover bound [TJ 
forces a < 2, it is an open problem to determine the ex- 
act critical value of a (i.e., the exact pattern capacity of 
the Hopfield network) 

We close this section by defining a new learning rule 
for a neural network. In words, the minimum probability 
flow learning rule (MPF) takes an input training pat- 
tern x and moves the parameters (3,9) a small amount 
in the direction of steepest descent of the MPF objec- 
tive function Kx>{3,9) with T> = {x}. Mathematically, 
these updates for Jjj and 9i take the form (where again, 
5=1 — 2x): 

AJij <x -8 iXj e^ 3 ^- e ^ 5i - 6 jXi e^ J **- e ^ (6) 
A9, cx 5 t e^ 3 ^ e ^ s \ (7) 

It is clear from ([6j),([7|) that MPF is a local learning rule. 

Experimental results. We performed several experi- 
ments comparing standard techniques for fitting Hopfield 
networks with minimizing the MPF objective function 
Q. All computations were performed on standard desk- 
top computers, and we used used the limited-memory 
Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm 
[T5] to minimize Q. 

In our first experiment, we compared MPF to the two 
methods OPR and PER for finding 64-node Hopfield net- 
works storing a given set of patterns V. For each of 20 
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FIG. 4. Shows time (on a log scale) to train a Hopfield net- 
work with n = 64 neurons to store m patterns using OPR, 
PER, and MPF (averaged over t = 20 trials). 
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FIG. 5. Shows fraction of exact pattern recovery for a per- 
fectly trained n = 128 Hopfield network using rules PER (fig- 
ure on the left) and MPF (figure on the right) as a function 
of bit corruption at start of recovery dynamics for various 
numbers m of patterns to store. We remark that this exper- 
iment and the next do not involve OPR as its performance 
was significantly worse than that of MPF and PER. 

trials, we used the three techniques to store a randomly 
generated set of m binary patterns, where m ranged from 
1 to 120. The results are displayed in Fig. [3] and support 
the conclusions of Theorem 1 and Theorem 2. 

To study the efficiency of our method, we compared 
training time of a 64- node network as in Fig. [3] with the 
three techniques OPR, MPF, and PER. The resulting 
computation times are displayed in Fig. |on a loga- 
rithmic scale. Notice that computation time for MPF 
and PER significantly increases near the pattern capac- 
ity threshold of the Hopfield network. 

For our third experiment, we compared the denoising 
performance of MPF and PER. For each of four values for 
m in a 128-node Hopfield network, we determined weights 
and thresholds for storing all of a set of m randomly gen- 
erated binary patterns using both MPF and PER. We 
then flipped to 64 of the bits in the stored patterns and 
let the dynamics ([I]) converge (with weights and thresh- 
olds given by MPF and PER) , recording if the converged 
pattern was identical to the original pattern or not. Our 



FIG. 6. Shows fraction of patterns (shown in red for MPF and 
blue for PER) and fraction of bits (shown in dotted red for 
MPF and dotted blue for PER) recalled of trained networks 
(with n = 64 nodes each) as a function of the number of pat- 
terns m to be stored. Training patterns were presented re- 
peatedly with 20 bit corruption (i.e., 31% of the bits flipped), 
(averaged over t = 13 trials.) 

results are shown in Fig[5j and they demonstrate the su- 
perior corrupted memory retrieval performance of MPF. 

A surprising final finding in our investigation was that 
MPF can store patterns from highly corrupted or noisy 
versions on its own and without supervision. This result 
is explained in Fig [6] To illustrate the experiment visu- 
ally, we stored m — 80 binary fingerprints in a 4096-node 
Hopfield network using a large set of training samples 
which were corrupted by flipping at random 30% of the 
original bits; see Fig. [2] for more details. 

Discussion. We have presented a novel technique for 
the storage of patterns in a Hopfield associative mem- 
ory. The first step of the method is to fit an Ising model 
using minimum probability flow learning to a discrete 
distribution supported equally on a set of binary target 
patterns. Next, we use the learned Ising model param- 
eters to define a Hopfield network. We show that when 
the set of target patterns is storable, these steps result 
in a Hopfield network that stores all of the patterns as 
fixed-points. We have also demonstrated that the result- 
ing (convex) algorithm outperforms current techniques 
for training Hopfield networks. 

We have shown improved recovery of memories from 
noisy patterns and improved training speed as compared 
to training by PER. We have demonstrated optimal stor- 
age capacity in the noiseless case, outperforming OPR. 
We have also demonstrated the unsupervised storage of 
memories from heavily corrupted training data. Further- 
more, the learning rule that results from our method is 
local; that is, updating the weights between two units 
requires only their states and feedforward input. 

As MPF allows the fitting of large Hopfield networks 



quickly, new investigations into the structure of Hopfield 
networks are posssible [16 . It is our hope that the ro- 
bustness and speed of this learning technique will enable 
practical use of Hopfield associative memories in both 
computational neuroscience, computer science, and sci- 
entific modeling. 
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